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Abstract 

Various statistical methods important for genetic analysis are considered and 
developed. Namely, we concentrate on the multifactor dimensionality reduction, 
logic regression, random forests and stochastic gradient boosting. These methods 
and their new modifications, e.g., the MDR method with "independent rule", are 
used to study the risk of complex diseases such as cardiovascular ones. The roles of 
certain combinations of single nucleotide polymorphisms and external risk factors 
are examined. To perform the data analysis concerning the ischemic heart disease 
and myocardial infarction the supercomputer SKIF " Chebyshev" of the Lomonosov 
Moscow State University was employed. 
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I Introduction 

The detection of genetic susceptibility to complex diseases (such as cardiovascular, on- 
cological ones etc.) has recently drawn much attention in many leading research cen- 
ters, see, e.g., [6] and [13]. According to the forecast of the World Health Organization 
(www.who.int), in 2030 the deaths related to cardiovascular diseases will exceed 23 mil- 
lions (this year about 17 millions), the oncological diseases will take the lives of more than 

II millions of our planet inhabitants and at least 2 millions of people will be the victims 
of the diabetes. Thus this research domain is important since one expects to provide for 
each person the prophylactic measures and medical treatment taking into account his/her 
genetic peculiarities which increase the risk of some diseases and protect from the oth- 
ers, see, e.g., [28]. Individual's DNA variations are typically described in terms of single 
nucleotide polymorphisms (SNP), i.e. the fragments of genetic code where a nucleotide 
change is possible. For more details see, e.g., |16]. The first examples of genetically based 
diseases (e.g., sicklemia) were related with a single mutation. Contrariwise many hard 
diseases such as diabetes, Alzheimer's disease and others have a complex character as they 
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can be provoked by mutations in different parts of the DNA code which are responsible 
for the formation of certain types of proteins. Quite a number of recent studies (see, e.g., 
[E], IH] and |10]) support the paradigm that the increasing risks of complex diseases 
can be explained by combinations of certain SNP whereas separate mutations have no 
dangerous effects. 

Thereupon it should be mentioned that there existed a longstanding demand for statis- 
tical analysis of biological and medical data. However, only in the first part of the 20th cen- 
tury, due to the classical contributions by K.Pirson, R.Fisher, H.Cramer, A.Kolmogorov, 
N.Smirnov, A.Wald and other prominent statisticians, the essential progress was achieved 
both in theory and applications. The methods developed were sufficient, e.g., for inves- 
tigation of the efficiency of new medicaments. The situation has changed radically at 
the beginning of the 2000 's when the laboratory methods of DNA analysis provided the 
data related to the personal human code structure. The achievements in decoding of 
the human genome have led to formation of vast data bases in the frameworks of the 
International Research projects, see, e.g., GAW16 [T7] and HapMap [2U]. Note also that 
software engineering plays an important role in such studies, see, e.g., [23] and ^5]. The 
cost of genomic analysis has fallen considerably in the last 10 years, allowing to collect 
large volumes of genetic data for genetic mapping of complex diseases. However statistical 
problems arising here require new methods of inference rather than classical ones. Indeed, 
the modern statistical models involve huge number of variables, parameters, hypotheses 
etc., while the sample size is usually moderate (several hundred or sometimes several 
thousand of observations, see, e.g., [M]). The sample design is limited both by costs of 
analysis which are still high and by difficulties due to the sample selection. In particular, 
the ethnic homogeneity should be taken into account, as well as the influence of external 
risk factors such as obesity, smoking etc. 

To perform reliable statistical inference, it is necessary to apply new powerful tools de- 
veloped in high-dimensional statistics, artificial intelligence, information retrieval, econo- 
metrics etc. Some of them have been adapted and further generalized in numerous papers 
by biostatisticians. Among the most important SNP analysis methods are the multi- 
factor dimensionality reduction (MDR), logic regression (LR), random forests (RF) and 
stochastic gradient boosting (SGB). All approaches based on these methods do not im- 
pose any strong restrictions on the dependence structure of variables under considera- 
tion (apart from independence and identical distribution of observations within certain 
groups). Thus a broad class of statistical models is defined and the model providing the 
best out-of-sample fit is selected. 

If one deals with too many parameters, overfitting is likely to happen, i.e. the esti- 
mated parameters depend too much on the given sample. As a result the constructed 
estimators give poor prediction on new data. On the other hand, application of a very 
sophisticated model may not capture the studied dependence structure of various factors 
efficiently. However the trade-off between the model's complexity and its predictive power 
allows to perform reliable statistical inference via new model validation techniques (see, 
e.g., [2] and [SO])- The main tool of model selection is the cross-validation, see, e.g., 
Its idea is to estimate parameters by involving only a part of the sample {training sample) 
and afterwards use the remaining observations {test sample) to test the predictive power 
of the obtained estimates. Then an average over several realizations of randomly chosen 
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training and test samples is taken, see [2T] . 

There are two closely connected research directions in genomic statistics. The first one 
is aimed at the disease risk estimation when the genetic portrait of a person is known (in 
turn this problem involves estimation of disease probability and classification of genetic 
data into high and low risk domains, see, e.g., [3D]). The second trend is to identify 
relevant combinations of SNPs having the most significant pathogenic (or, in other way, 
protective) influence. Both directions are presented in this paper. Moreover, the authors 
propose further development of various statistical methods and apply them to study of 
the risks of cardiovascular diseases. For this purpose the new software concerning the 
employment of the mentioned statistical methods was designed and used. 

Due to high-dimensionality of data many numerical procedures based on the above 
mentioned statistical methods are very time consuming. The authors are grateful to the 
Chancellor of the Lomonosov Moscow State University (MSU) Professor V.A.Sadovnichy 
and to the Deputy Director of the MSU Research Computing Center Professor V. V. Voevo- 
din for the opportunity to use the supercomputer SKIF MSU "Chebyshev" . 

This investigation was started in the framework of the project headed by Professor 
V.A.Tkachuk, the Dean of the Faculty of the Fundamental Medicine of the MSU. An 
overview of preliminary results of the work was presented at the International conference 
"Postgenomic methods of analysis in biology, and laboratory and clinical medicine" in 
the talk by Professor A.V.Bulinski (see [9] and [TU]). 

2 Methods 

We start with some definitions. Let be the the number of patients in the sample and 
the vector = [Xl, . . . , X^) consist of genetic (SNP) and external risk factors of j-th 
individual, j = 1, ...,N. Here n is the total number of factors, and X^! is the value of i-th 
variable (characterizing SNP or external factor) of j-th individual. These variables are 
also called explanatory variables or predictors. If stands for an SNP, we set 

{0, no mutation in i-th SNP, 
1, heterozygous mutation, (1) 
2, homozygous mutation. 

We assume that the external risk factors also take no more than three values, denoted 
by 0, 1 and 2. For example, we can specify a presence or an absence of obesity (or 
hypercholesterolemia etc.) by values 1 and respectively. If the external factor takes 
more values (e.g., blood pressure), we can divide individuals into three groups according 
to its values. 

Further on Xl, . . . , X^^ stand for genetic data and X^_^-^, . . . , X^ for external risk 
factors. Let a binary variable {response variable) be equal to 1 for a case, i.e. whenever 
j-th individual is diseased, and to —1 otherwise (that is for a control). Set 

e = • • • , e^) where ^ = Y^), j = l,...,N. (2) 

Suppose . . . are i.i.d. discrete random vectors having the same law as a vector 
{X,Y) and independent of this vector. Assume that X = (Xi,...,X„). All random 
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vectors (and random variables) are considered on a probability space {Q, J^,P), E denotes 
the integration w.r.t. P. 

The main problem is to find a function in genetic and external risk factors describing 
the phenotype (that is the individual healthy or sick) in the best way. 



2.1 Prediction algorithms 

Let X := {0, 1,2}" denote the space of all possible values of explanatory variables. Any 
function f : X ^ {—1,1} will be called a theoretical prediction function. Define the 
balanced or normalized prediction error for the theoretical prediction function / as 

Errif):=E\Y-fiX)\^PiY) 

where the penalty function : { — 1,1} — t- M+. Obviously, 

Errif) = 2^(-l)P(/(X) = 1,Y = -1) + 2^(1)P(/(X) = -l,Y = l). (3) 

Clearly Err{f) depends also on the law of {X,Y). Following jl5] and |18] we put 

^(y) = ^p^^, y€{-l,l}, (4) 

the trivial cases P(y = — 1) = and P{Y = 1) = are excluded. Then 

Errif) = ip(/(X) = l|r = -1) + lp(/(X) = -1\Y = 1). (5) 

For a balanced sample considered in [36], P{Y = —1) = P{Y = 1) = 1/2 and Err{f) = 
E\Y — f{X)\/2 is equal to the classification error P(Y ^ f{X)). 

The reason to consider this weighted scheme is that a misclassification in a more rare 
class should be taken into account with a greater weight. Otherwise, if the probability 
of disease P{Y = 1) is small, then the trivial function f{x) = —1 may have the least 
prediction error. The approach to calculation of the prediction error based on penalty 
functions is not the only one possible. Nevertheless Velez et al. [15] showed that for 
models with high computational costs it outperforms substantially other methods such as 
over- and undersampling. 

It is easy to prove that the optimal theoretical prediction function minimizing the 
balanced prediction error is given by 

/.(.)=/ 1- pW>P(^'=1). ,6) 
1—1, otherwise. 

where 

p{x) = P{Y = 1\X = x), xeX. (7) 

Then each multilocus genotype (with added external risk factors) x & X is classified as 
high-risk if f*{x) = 1 or low-risk if f*{x) = —1. 

Since p{x) and P{Y = 1) are unknown, the immediate application of is not possible. 
Thus we try to find an approximation of unknown function /* using a prediction algorithm 
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that is a function fpA = fpA{x,^{S)) with values in { — 1, 1} (recall that Y E {—1, 1} a.s.) 
which depends on x E X and the sample 

as) = {e, J e S} where Sc{l,...,N}. (8) 

The simplest way is to employ formula (Q with p{x) and P{Y = 1) replaced by their 
statistical estimates. For example introduce 



j:^^,i{y^ = i,x^ = x} 



and take 



PsiY 



X e X, 



1} 



(9) 



(10) 



where I{A} stands for the indicator of an event A and jl-D denotes the cardinality of a 
finite set D. 

Along with (fTOl) we will consider 



Ps{Y = l\X e C) 



C CX. 



fill 



Thus ([9]) is a special case of (fTTj) for C = {x} with x E X. Note that more difficult way 
is to search for the estimators of /* using several subsamples of ^. 

Assume that we constructed a prediction algorithm fp^. Then taking in mind ([5]) set 



Err{fpA{;aS))) = l Yl P (/pa(X, ^(5)) ^ y|F = y) . 



(12) 



As a law of {X,Y) is unknown one can only construct an estimate Err{fpA{-,aS))) 
of Err{fpji{-,C,{S))). In Section 3 we use the estimated prediction error of a prediction 
algorithm fpA which is based on i^'-fold cross-validation and has the form 



Err 



K 



, K E i{fPA{x^.i{Sk))i^y.Y^ =y] 



2 ^ K 

y6{-l,l} fc=l 



where 



Sk 



{k-l) 



N' 
K 



+ 1, 



k 



N' 
K 



E HY^ = y} 



I{k < K} + NI{k = K}} , 



(13) 



(14) 



Sk = {1, . . . , N} \ Sk and [a] is the integer part of a G M. 

A very important problem is to make sure that the prediction algorithm fpA gives 
statistically reliable results. The quality of an algorithm is determined by its prediction 
error (fT2|) which is unknown and therefore the inference is based on consistent estimates of 
this error. Clearly the high quality of an algorithm means that it captures the dependence 
between predictors and response variables, so the error is made more rarely than it would 
be if these variables were independent. Consider a null hypothesis Hq that X and Y are 
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independent. If they are in fact dependent, then for any significant prediction algorithm 
fpA an appropriate test procedure involving fpA should reject Hq at the approximate 
significance level a, e.g., 5%. Intuitively, this shows that results of the algorithm could 
not be obtained by chance. For such a procedure, we take a permutation test (see [18]). 
Its idea is as follows. 

Permutation test for a given statistic L{C,) (we consider L{C,) = Err{fpA{-,C,)) is done 
by the following steps. 

1. Generate B independent random vectors (vrj, . . . , tt^), I ^ b < B, with the uniform 
distribution over all permutations n^r of 1, . . . , A^. 

2. Compute ErrK,b = Err K{fpA,ib)i ^ < B, with 

6=((x\F-^),...,(xM^-^)). 

3. Find the Monte Carlo p-value (see, e.g., [27, p. 63]): 

p = F{E?rK{fpA,0) (15) 

where F = F{z) is the empirical cumulative distribution function (c.d.f.) defined 
by the relation 

I ^ 

F{z) = -J2^{ErrK,b<z}, z eR. 

6=1 

4. If p < a, reject Hq, otherwise not. 

According to [IH], one ideally has to use all permutations belonging to Hat but this 
is impractical in view of computational costs. Thus the Monte Carlo approximations for 
the true p-value p = F[ErrK{fpA,0) employed, here F is the c.d.f. of Err^fi- The 
upper bound for |p — p| is 1/2a/B (see [I^). This could be used to determine the number 
B of simulations for a desired accuracy. 

Note also that if the estimate of the error function for the algorithm fpA{-,0 is asymp- 
totically optimal, i.e. converges in probability to the error of the optimal prediction func- 
tion /* as — oo (,^ depends on A^), then the rule of thumb is to suspect overfitting if 
Err^ifpAyO is close to 1/2, which is a probability limit of this error under Hq as N ^ oo. 

We use complementary approaches to analyze dataset related to complex diseases. 
Each approach (MDR, LR and machine learning) is characterized by its own way of con- 
structing prediction algorithms. For each method one or several prediction algorithms 
admitting the least estimated prediction error are found (a typical situation is that there 
are several ones with almost the same estimated prediction error). These prediction al- 
gorithms provide a way to determine the domains where the disease risk is high or low 
(depending on the value of the corresponding prediction function). It is also possible to se- 
lect combinations of SNPs and external risk factors whose presence influences the liability 
to disease to a great extent. Some methods allow to present such combinations immedi- 
ately. Others, which employ more complicated forms of dependence between explanatory 
and response variables, need further analysis based on modifications of permutation tests. 

Now we pass to the description of various statistical methods and their applications 
to the cardiovascular risk detection. 
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2.2 Multifactor dimensionality reduction 

Ritchie et al. [36j introduced multifactor dimensionality reduction (MDR) as a new 
method of analyzing gene-gene and gene-environment interactions. Rather soon the 
method became very popular. Since the first publication more than 200 papers applying 
MDR in genetic studies were written (see, e.g., references in [13]). 

MDR is a flexible non-parametric method not depending on a particular inheritance 
model. We give a rigorous description of the method following ideas of [36] and |15]. As 
mentioned earlier, the probability p{x) introduced in (I7j) is unknown. To find its estimate 
one can apply maximum likelihood approach assuming that the random variable I{Y = 1} 
conditionally on X = x has a Bernoulli distribution with unknown parameter p{x). Then 
we come to ([9|). 

A direct calculation of estimate in (Q with exhaustive search over all possible values 
of X is highly inefficient, since the number of different values of x grows exponentially with 
number of risk factors. Moreover, such a search leads to overfitting. Instead, it is usually 
supposed that p{x) non-trivially depends not on all, but certain variables x,. That is, 
there exist I e N, I < n, and {kl, . . . , k^), where 1 < kl < . . . < k^ < n, such that for 
each X = (xi, . . . , Xn) € X, the following relation holds: 

p(x) = P{Y = l\Xk* = Xfc., ...,Xfc; = Xfc;). (16) 

In other words only few factors influence the disease and the others can be neglected. A 
minimal combination of factors (X^*, . . . ,Xk*) in formula (fT6]) is called the most signifi- 
cant. Clearly it is the most signiflcant combination which has the least prediction error. 
Indeed, if we consider any other combination of pairwise different indices ki, . . . ,kr and 

set 

' 1, P(r = i|x,, =Xfc,,...,Xfc, = Xfcj>P(r = i), 



— 1, otherwise, 



fki,...,kr{^) — 
then we obviously have 

Err (/fc*,...,,;) < Err{h,_k^) (17) 

where Err{f) is calculated according to (jSj). 

To choose the most signiflcant combination, exhaustive search over all possible combi- 
nations of factors is apphed. For each {ki, . . . , kr} C {1, . . . , n} and any x E X consider 

Cki,...,kr{x) = {u = (mi, . . . ,Un) e X : Uk, = Xk,, i = l,...,r} 

and for S appearing in (jSj) deflne a prediction algorithm (cf. (Q) by 



fki,...,kr{x,^{S)) :-- 



1, Ps{Y = 1\X G Ck,,...,kAx)) > Ps{Y = 1), 
-1, otherwise, 



here we use formulas (ITO!) and (fTTl) . It is easy to show that ErrK{fki,...,kr^O ~^ Err{fk^^...^kr) 
in probability as — )■ oo (^ depends on A^). Consequently, (fT7j) implies that, for any 
£ > and all A^ large enough, with probability close to 1 one has 

ErrKifki...,k;,0 < ErrKifk,,...,kr,0 
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Hence, it is natural to pick one or a few combinations of factors with the smallest empirical 
prediction errors as an approximation for the most significant combination. 

The last step in MDR is to determine statistical significance of the results. Here we 
test a null hypothesis of independence between X and Y i.e. between risk factors X and 
a disease Y. This can be done via the permutation test described in Section 2.1. 

MDR method with "independent rule". We propose multifactor dimensionality 
reduction with "independent rule" (MDRIR) method to improve the estimate of probabil- 
ity p{x). This approach is motivated by Park ^35j, who deals with classification of large 
array of binary data. The principal difficulty with employment of formula is that the 
number of observations in numerator and denominator of the formula might be small even 
for large N (see, e.g., [26]). This can lead to inaccurate estimates and finally to a wrong 
prediction algorithm. Moreover, for some samples the denominator of can equal zero. 

The Bayes formula implies that 

. ^ ^ p(x = x|r = i)P(y = i) 

P(x = x\Y = i)P(r = 1) + P(x = x\Y = -i)P(r = -1) ' ^ ' 

where the trivial cases P(Y = — 1) = and P(Y = 1) = are excluded. Substituting ( TT^ 
into (E]) we obtain the following expression for prediction function: 

' otherwise. 

As in standard MDR method described above, we will assume that formula f ll6p holds. 
It was proved in [35] that for a broad class of models (e.g., Bahadur model [3], logit model 
[13] ) the conditional probability P (X^^ = xi, . . . ,Xk^ = Xr\Y = y), where y = ±1, can 




be estimated in the following way: 

r 

Ps {X,, = xi, . . . ,Xfc^ = Xr \Y = y) = l[Ps (X,, = x,\Y = y), (21) 



1=1 



here (cf. ([II])) 

E I{Xi = X, Y^ = y} 
Ps (X.. =x\Y = y) = ^j^y.^y^ • (22) 

jes 

Combining ( 1TB]) . fl2U]) . f l^ and fl2^ we find the desired estimate of f*{x). 

A number of observations in numerator and denominator of fl22]) increases considerably 
comparing with (fl8]) . It allows to estimate the conditional probability more precisely 
whenever the estimate introduced in (I2T!) is reasonable. 

Thus, as opposed to standard MDR method, MDRIR uses alternative estimates of 
conditional probabilities. All other steps (prediction algorithm construction, prediction 
error calculation) remain the same. Let us mention that as far as we know this modifi- 
cation of MDR has not been applied before. It is based on a combination of the original 
MDR method |36l and the ideas of [35] . 
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2.3 Logic regression 

The logic regression (LR) was proposed in [38] • Further generahzations are given in [T6] . 
[25], [10], [39] and other works. LR is based on the classical binary logistic regression 
[22] and exhaustive search for relevant predictor combinations. The main difficulty is 
to organize the search efficiently. The LR method was applied to identification of the 
most significant SNP combinations in [1], [33] and [IQ]. Note that for genetic analysis 
it is convenient to use explanatory variables taking 3 values. Thus we employ ternary 
variables, whereas the authors of the above-mentioned papers employ binary ones. 

Let p{x) be the conditional probability of a disease defined in (^^. We suppose that 
trivial situations when p{x) G {0, 1} do not occur and omit them from the consideration. 
To estimate p{x) we pass now to the logistic transform 

q{x) = X{p{x)) (23) 

where X{z) = ln(z/(l — z)), z G (0,1), is the inverse logistic function. The logistic 
function itself equals to A{t) = (1 + e"*)^"*^, t G M. Note that we are going to estimate the 
unknown disease probability with the help on linear statistics with appropriately selected 
coefficients. Therefore it is natural to avoid restrictions on possible values of the function 
estimated. Thus the logistic transform is convenient, because p{x) G (0, 1) for x E X 
while q{x) can take all real values. 

Consider a class Q of all real- valued functions in ternary variables xi, . . . , x„. We call 
a model of the dependence between the disease and explanatory variables any subclass 
Meg. Set 

APs{Y = y) 

here PsiX = u) was introduced in ffTOl) . Define the normalized smoothed score function 

m as)) = ^Y1 <Pi-Y'KX^))i^iY\ as)) (24) 

where S is introduced in (JHl), (j){t) = log2(l + e*) for t G M, and h G Ai. In contrast to 
previous works our version of LR scheme involves normalization (cf. ([3])), i.e. taking the 
observations with weights dependent on the proportion of cases and controls in subsample 

as). 

An easy computation yields that argmin/i^^vi L{h,aS)) equals to 

argmax-^y | \nA{h{X')) ^ ^ + ln(l - A{h{X^))) ^ 
f^^M^Sj^^y ''2Ps{Y=l) 2Ps{Y = 

That is, minimizing the score function is equivalent to the search of normalized maximal 
likelihood estimate of q. Note that estimating the disease probability in this setup is 
closely connected with the problem of data classification, i.e. predicting the disease by 
the value of x G A". Recall that in standard classification problem instead of the score 
function flMl) one uses the following normalized estimate of the error probability 

mas)) = < oMl^^e(5))• 
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Figure 1: A tree T representing a function T{x) = (xi x X2) x (xs + 0:4) . 



In fact the optimal choice of h for these problems coincide if the underlying model M. 
is correctly specified (i.e. q E Ai), see [5j. However the usage of score function L has 
an important advantage over L because one has to evaluate the minimum of a smooth 
function. 

A wide and easy to handle class of models is obtained by taking functions linear in 
variables xi, . . . , x„ or in their products. In turn these functions admit a convenient rep- 
resentation by elementary polynomials. Recall that an elementary polynomial (EP) is a 
function T in ternary variables xi, . . . , x„ belonging to {0, 1, 2} which can be represented 
as a finite sum of products x^^ . . . x"" where Ui, . . . , m„ G Z_|_. The addition and multi- 
plication of ternary variables is considered by modulo 3. Any EP can be represented as 
a binary tre^ in which knots (vertices which are not leaves) contain either addition or 
multiplication sign, and each leaf corresponds to a variable. Figured] provides an example 
of a binary tree. Different trees may correspond to the same EP, thus this relation is not 
one-to-one. However, it does not influence our problem, so we regain the notation T for 
a tree. A finite set of trees F = (Ti, . . . , Tg) is called a forest. For a tree T, its complex- 
ity C{T) is the number of leaves. The complexity C{F) of a forest F is the maximal 
complexity of trees constituting F. 

It is clear that if g E Q then there exists s > 1 such that g has the following form: 



here /3o, ■ ■ ■ , /3s G and Ti, . . . , are EP. 

Let us say that function g belongs to a class Qr{s), where s,r G N, if there exist a 
decomposition f l25|) of g such that all trees Tj (i = 1, . . . , s) have complexity less or equal 
r. We identify a function g G Qr{s) with pair {F, (5) where F is the corresponding forest 
and f3 = (/3o, • • • , Ps) is the vector of coefficients in (^^. 

Minimization of L{h,^{S)) defined by over all functions h E M. C. Qr{.s) is done 
in two alternating steps. First we find the optimal value of /3 while F is fixed (which is the 
minimization of a smooth function in several variables) and then we search for the best 
F . Here one uses stochastic algorithms, since the number of such forests increase rapidly 

"'For the basic concepts of the graph theory see, e.g., [7]. 



s 




(25) 



10 



when the complexity r grows. For s G N, a forest F = (Ti, . . . , Tg) and a subsample C,{S) 
(see ([8])) consider a prediction algorithm /j^ setting 



A subgraph B of a tree T is called a branch if it is itself a binary tree (i.e. it can 
be obtained by selecting one vertex of T together with its offspring). Sum and product 
signs standing in a knot of a tree are called operations, thus * stands for sum or product. 
Following |38], call the tree T a neighbor of T if it is obtained from T via one and only 
one of the following transformations. 

1. Changing one variable to another in a leaf of the tree T {variable change). 

2. Replacing an operation in a knot of a tree T with another one, i.e. sum to product 
or vice versa {operator change). 

3. Changing a branch of two leaves to one of these leaves {deleting a leaf). 

4. Changing a leaf to a branch of two leaves, one of which contains the same variable 
as in initial leaf {splitting a leaf). 

5. Replacing a branch Bi * B2 with the branch Bi {branch pruning). 

6. Changing a branch 5 to a branch Xj * B {branch growing), here Xj IS Sb variable. 

Figure [2] depicts results of these operations applied to the tree T of Figure [H We say 
that forests F and F are neighbors if they can be written as F = {Ti,T2, . . . ,Ts} and 
F = {Ti, T2, . . . ,Ts} where Ti and Ti are neighbors. The neighborhood relation defines a 
finite connected graph on all forests of equal size s with complexity not exceeding r. To 
each vertex F of this graph we assign a number ^p{F). To find the global minimum of a 
function defined on a finite graph we employ the simulated annealing method (see, e.g., 
[in], [22] and [SZ])- This method constructs some specified Markov process which takes 
values in the graph vertices and converges with high probability to the global minimum of 
the function. To avoid stalling at a local minimal point the process is allowed to pass with 
some small probability to a point F having greater value of (f{F) than current one. We 
propose a new modification of this method in which the output is the forest corresponding 
to the minimal value of a function (p{F) over all (randomly) visited points. 




where h = {F, (3) and 




(26) 



Define also the normalized prediction error of a forest F = (Ti, . . . , T^) as 



lp{F)=E?rK{f[ni;0,0- 
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Figure 2: Neighbors of the tree T. 
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Figure 3: CART representing the prediction ((a;i = 2) and (3:3 = 2)) or (X4 > 0). 



2.4 Machine learning methods 

Let us describe (see, e.g., [13]) two among the most popular machine learning methods 
- random forests (RF) and stochastic gradient boosting (SGB). They belong to ensemble 
methods which combine multiple predictions from a certain base algorithm to obtain 
better predictive power (i.e. less estimated prediction error). We use classification and 
regression trees (CART) for a base learning algorithm because it showed good performance 
in a number of studies (see |21j). 

Classification tree T is a binary tree having the following structure. Any leaf of T 
contains either 1 or —1 and for any vertex P in T (including leaves) there exists a subset 

of the explanatory variable space X such that the following properties hold: 

1. Ap = X iiP is the root of T; 

2. if vertices P' and P" are children for P, then Api Apn = Ap and Apt (1 Ap// = 0. 

In particular, subsets corresponding to the leaves form the partition of X. To obtain a 
prediction of Y given a certain value x & X of the random vector X, one should go along 
the path which starts from the root and ends in some leaf turning at each parent vertex P 
to that child P' for which Api contains x. At the end of the x-specific path, one gets either 
1 or — 1 which serves as a prediction of Y. Figure [3] provides an example of a classification 
tree. Namely, the partition of X is formed by values of boolean functions standing in 
parent vertices. For each x starting from the root of the tree we calculate the value of a 
boolean function and move along the edge marked with the value obtained (1 or 0). The 
left child of the root corresponds to the subset {x E X : xi = 2}, while the right one to 
its complement in X. Next, the leftmost leaf stands for a subset {x E X : xi = 2, x^ = 2}, 
and if X falls in this subset, we predict that Y = 1; the rightmost leaf stands for a subset 
{x E X : xi < 2,X4 = 0}, and if X takes values in this subset, we predict Y = —1. 

Classification tree could be constructed via CART algorithm (if it is the case, we 
will call it CART). The algorithm proceeds iteratively. That is, on the l-th step of the 
algorithm (/ = 1,2,...), each element A of the current partition Ai {Ai = X) of the set 
X is divided into two disjoint parts 

A'^ {i,t) = {{xi, . . . , Xn) E A : Xi < t} and A~ {i,t) = { (xi , . . . , x„) G A : Xi > t} 
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minimizing the sum G{A'^(i, t)) + G{A (i, t)) over i = 1, . . . ,n and t G {0, 1}. Here the 
empirical Gini index 

G{C) = 2Ps{Y = 1\X eC){l- PsiY = 1\X e C)) 

with C C X and PsiX = l\X E C) (see ( |TT|) ) measures the heterogeneity of the subsample 
{j G S* : ^ C} w.r.t. response variable Y. Any uninformative partition with 

mm( G{A+ {t,t)) + G{A-{i,t))) > G{A), 
{i,t) 

is not made. 

The algorithm stops whenever a number of leaves D reaches some critical level which 
is chosen via some data- dependent criteria (see [21], page 308). For a subsample ^{S) of 
^, each CART defines a prediction algorithm 

D 

fix, as)) = J2 (^MS))nx G AMS))} (27) 

d=l 

where {yli(^(S')), . . . , Ad{C,{S))} is the partition of X corresponding to the leaves, 

1, ^{jeS: = i,x^eAMS))} > ^{jeS: Y^ = -i,x^eAMS))}; 



aaiaS)) - |_^'^ iihertise. 



RF is a non-parametric method of estimating conditional probability p{x). It was 
successfully applied to genetics data in a number of papers (see references in [i3|). It 
could be briefly described as follows (see chapter 15 in [2lj for details). Generate B boot- 
strap samples from the initial sample where one could choose B = max{ [A^ log A^] , 1000} 
according to [31]. For 6-th bootstrap sample {I < b < B) construct a CART prediction 
algorithm f^iXx {X x {— 1,1})^— ;■{ — 1,1} defined according to (!27|) and take 

B 



PRp(x,e(5)) = (i?-i5^/,(x,e(5)) + 1)/2 



6=1 

as an estimate of p{x). 

It is shown in [5J that generally RF method gives consistent estimates of p{x) only 
if the number of partitions used in CART grows slower than the sample size. A final 
prediction algorithm fKF{x,C,{S)) is constructed from the estimate PRF{x,aS)) similarly 
to (ED, i.e. 

' 1, pMx,aS))>Ps{Y = l), 
-1, otherwise. 



fRF{x,aS)) 



The distinctive features of this method are low computational costs and the ability to 
extract relevant predictors when the number of irrelevant ones is large (see [3]). 

SOB is another non-parametric method of estimating conditional probability p{x). 
This method is used in a number of procedures for studying genetics data (see, e.g., [47J). 
SOB method can be described as follows ([T5]). 



14 



1. Pass on the input of the algorithK0 initial parameters D, M and p,i] G (0, 1). 



2. Put m = 0, ^o{S) = ^{S) and 



2 P5(r = -1) 

3. Increase m by 1 and define 

Yl :-- 



1 + exp{2Y^ U-i{X^ , {US)}T=o)} 

Choose a random subset in ^m{S) = {{X^ ,Y^)}ji^s with [ri'\\S] elements. Construct 
CART prediction algorithm (with D leaves) J2d=i (^7 iuis)) I {x E A^{^^{S))} on 
the chosen subset. Compute weight coefficients 

V Y^ 

w^U(S)) = ^'^-^ _. , d=l,...,D, 

where the random set J = {j : G A^{C,m{S))}, and put 

D 

fm = fm-1 + py^wg'(^m('5'))/A^"(g,„(g)), 
d=l 

here p is the memory relaxation parameter. 
4. If m < M, go to Step 3, otherwise determine a final estimate 

PSGBix,^{S)) 



l + exp{-2fM{x,aS),{USm=i)y 

This algorithm is to be run for several times with different parameters D, M, p and rj. 
Then their optimal values could be chosen via cross-validation (see section 16.3.1 in [21]). 
Small values of r] (= 0.1, 0.05, 0.0225 etc.) help to get accurate estimates for relatively 
noisy data. 

Standard RF and SGB work poorly for unbalanced samples. One needs either to 
balance given datasets (as in [TT]) before these methods are applied or use special mod- 
ifications of RF ([5]) and SGB ([22]). To avoid overfitting, permutation test needs to be 
done. 

A common problem of all machine learning methods is a complicated functional form 
of the final probability estimate p{x,C,) (w.r.t. x). In genetic studies, one wants to pick up 
all relevant combinations of SNP and risk factors, based on a biological pathway causing 
the disease. Therefore, the final estimate p{x,C,) is to be analyzed. 



''This algorithm is not to be confused with prediction algorithms. 
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We describe one of the possible methods of such analysis within RF framework and 
called conditional variable importance measure (CVIM). One could determine CVIM for 
each predictor Xi in X and range all Xj in terms of this measure. Following [12], CVIM of 
predictor Xj given certain subvector Zi of X is calculated as follows (supposing Zj takes 
values Zii,.. . 

1. Construct a vector (/i, . . . , 1^), randomly permuting 1, . . . , X in each subset 

Aik = {j : = Zik}, k = 1,..., m{i). 

2. Generate B bootstrap samples = {{X^\ Y^''),j = 1, . . . , X) , 6 = 1, . . . , 5. For 
each of these samples, construct a classifier fb{x,^b) and calculate 

CVIM, = j^Y. = fbi^'^^b)} - ^ E = fbi^^'^^b)} 

where = {j G {1, . . . , X} : (X^ Y^) i 6}- 

3. Compute the final CVIM using the formula 

B 

CVIM = B-^ E CVIMfe. (28) 

6=1 

Any permutation (/i, . . . , /^v) in the CVIM algorithm destroys dependence between Xj and 
(y, Z_.j) where Z_i consists of all components of X which are not in Z^. At the same time 
it preserves initial empirical distribution of (Xj,Zj) calculated for the sample ^. After 
that the average loss of correctly classified Y is calculated. If it is relatively large w.r.t. 
CVIM of other predictors, then Xj plays important role in classification and vice versa. 

For Zj, one could take all components X^ [k ^ i) such that the hypothesis of the 
independence between X^ and Xj is not rejected at some significance level (e.g., 5%). 
Note also that CVIM-like algorithm could be used to range pairs of SNP and risk factors 
w.r.t. the level of association to the disease. This will be done elsewhere. 



3 Applications: risks of IHD and MI 

We employ here the various statistical methods described above to analyze the influence of 
genetic and external factors on risks of ischemic heart disease (IHD) and myocardial infarc- 
tion (MI) using the data for 454 individuals (333 cases, 121 controls) and 333 individuals 
(165 cases, 168 controls) respectively. These data contain values of seven SNPs (PAI-1, 
Gpla, GpIIIa, FXIII, FVII, IL-6, Cx37), as well as four external risk factors, namely, 
obesity (Ob), arterial hypertension (AH), smoking (Sm) and hypercholesterolemia (HC). 
The age of all individuals in case and control groups ranges from 35 to 55 years, which 
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reduces its influence on the risk analysis. For each of considered methods, K-fold cross- 
validation is used with K = Q. As shown in [31] and [H], the standard choice of partition 
number of cross-validation from 6 to 10 does not change the prediction error significantly. 
We take K = Q as the sample sizes do not exceed 500. The supercomputer SKIF MSU 
"Chebyshev" was involved to perform computations. All applied methods have prediction 
error less than 0.25, so predictions constructed have significant predictive power. Indeed, 
in [12] and [H] the interplay between genotype characteristics and MI development was 
also studied, with estimated prediction errors 0.30-0.40. Further on we write prediction 
error instead of estimated prediction error. 

3.1 MDR and MDRIR methods 

Ischemic heart disease 

Table [1] contains (estimated) prediction errors of the most significant combinations 
obtained by MDR analysis of ischemic heart disease data. At Figure H] a plot of empirical 
distribution function of prediction error is given when the disease is not linked with 
explanatory variables. We use here the simulated samples introduced in Section [2?T| 
with h = 1,. . . ,B where B = 100. One can see that out of these 100 simulations, the 
corresponding prediction error was not less than 0.42. Note that Monte Carlo p-value 
(fT5|) of all three combinations is less than 0.01 (since their prediction errors are much less 
than 0.42), which is usually considered as a good performance. 



Factors 


Prediction error 


Gpla, FXIII, AH, HC 


0.231 


Cx37, AH, HC 


0.238 


GpIa,Cx37, AH, HC 


0.241 



Table 1: The most significant combinations obtained by MDR analysis for IHD data. 

Table |5] contains the results of MDRIR method, which are similar to results of MDR 
method. However, it is worth mentioning that MDRIR method allows to identify addi- 
tional combinations with prediction error around 0.24. 



Factors 


Prediction error 


FXIII, FVII, AH, HC 


0.240 


FXIII, AH, HC 


0.242 


Gpla, Cx37, AH, HC 


0.247 



Table 2: The most significant combinations obtained by MDRIR analysis of IHD data. 

It follows from Tables [1] and |2] that hypertension and hypercholesterolemia are the 
most important external risk factors. Indeed, these two factors appear in every of 6 
combinations. 
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Figure 4: The empirical distribution function of the estimated error for the case when 
disease risk and predictors are independent (permutation test), IHD dataset. 

To perform a more precise analysis of influence of SNPs on IHD provoking we ana- 
lyze gene-gene interactions. We used two different strategies. Namely, we applied MDR 
method to a subgroup of individuals who are not subject to any of the external risk fac- 
tors (i.e. to non-smokers without obesity and without hypercholesterolemia, 51 cases and 
97 controls). Another strategy is to apply MDR method to the whole sample, but to 
take into account only genetic factors rather than all factors. Table |3] contains the most 
significant combinations of SNPs and their prediction errors. 



Method 


Genetic factors 


Prediction 
Error 


MDR on a subgroup of individuals 
who do not have any risk factors 


Gpla, Cx37 


0.281 


MDR method on the whole group 
taking into account only genetic factors 


Gpla, Cx37 


0.343 



Table 3: Comparison of the most significant SNP combinations obtained by two different 
ways of MDR analysis of IHD data. 

It turned out that both methods yield similar results. Combination of SNPs Gpla and 
Cx37 has the biggest influence on IHD. Prediction error is about 0.28-0.34, and smaller 
error corresponds to a risk-free sample. Moreover it follows from Tables [2] and |3] that 
prediction error signiflcantly dropped after additional exogenous factors were taken into 
account (the error is 0.247 if additional external factors are taken into account and 0.343 
if not). 

Thus based on ischemic heart disease data with the help of Tables [TH3] we can make 
the following conclusions. Combination of two SNPs (Gpla and Cx37) and two external 
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factors (hypertension and hypercholesterolemia) has the biggest influence on IHD. Also 
FXIII gives additional predictive power if AH and HC are taken into account. 

Myocardial infarction 

Prediction errors of the most significant combinations obtained by MDR analysis of MI 
data are presented in Table HI Figure [5] contains the plot of empirical c.d.f. of prediction 
error if disease is not linked with risk factors. This curve shows that for all 100 simulations 
of the estimated prediction error was not less that 0.38. Note that Monte Carlo p- value 
of all combinations is less than 0.01. 



Factors 


Prediction error 


GpIIIa, FXIII, Cx37, AH 


0.343 


GpIIIa, FXIII, FVII, Cx37 


0.347 


Cx37, Sm 


0.356 



Table 4: The most significant combinations obtained by MDR analysis of MI dataset. 





D 
D 




D 

D,3S 0,33 0,4 0,41 0,42 0,43 0,44 0,45 0,46 0,47 

Figure 5: The empirical c.d.f. of the estimated error for the case when disease risk and 
predictors are independent (permutation test), MI dataset. 

MDRIR analysis of the same dataset gives a clearer picture (see Table [5]). 

Apparently, combination of smoking and SNP Cx37 is the most significant. These 
two factors appear in all combinations in Table |5l Involving any additional factors only 
increases prediction error. 

The explicit form of the prediction algorithm based on Cx37 and Sm shows that these 
factors interact nonlinearly. Smoking as well as Cx37 homozygote leads to the disease. 
However wild-type allele can protect from consequences of smoking, because combination 
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Factors 


Prediction error 


Cx37, Sm 


0.351 


GpIIIa, Cx37, Sm 


0.353 


GpIIIa, Cx37, Sm, HC 


0.355 



Table 5: The most significant combinations obtained by MDRIR analysis of MI dataset. 

of smoking and Cx37 wild-type is a protective one (i.e. value of prediction algorithm of 
this combination is -1). 

3.2 Logic regression 

We performed several research procedures both for IHD and Ml data, with different 
restrictions imposed on the statistical model. To describe these models set 

{Xi, . . . , Xn) — ( i?i , . . . , Rk) 

where variables Z = {Zi, . . . , Z^) stand for SNP values (PAI-1, Gpla, Gpllla, FXlll, 
FVII, IL-6, Gx37 respectively) and R — . . . , denote external risk factors (Ob, 
AH, Sm, HG), m = 7. k = 4. Wc consider four different models in order to analyze both 
total influence of genetic and external factors and losses in predictive force appearing when 
some factors are excluded. In our applications we will take s = 3, as search over larger 
forests for samples with modest sizes can give very complicated and unreliable results. 

Model 1. We consider the class Ai (see Section 2.3) consisting of the functions h 
having a form 

s k 

h{Z,R) = (3o + J2WZ + 

v=l v=l 

where the coefficients /3i G M and Tj are polynomials identified with trees. In other words 
we require that external factors are present only in trees consisting of one variable. 
Model 2. Now we assume that any function h E M. has the representation 

s 

h{Z, R)^p^ + J2 WZ,, ...,Zm,Ri,..., Rk) (29) 

v=l 

where /3i G M and Tj are polynomials identified with trees. Thus we allow the interaction 
of genes and external factors in order to find significant gene-environment interactions. 
However we impose additional restrictions to avoid too complex combinations of external 
risk factors. We do not tackle here effects of interactions where several external factors 
are involved. Namely, we consider only the trees satisfying the following two conditions. 

1. If there is a leaf containing external factor variable then the root of that leaf contains 
product operator. 

2. Moreover, another branch growing from the same root is also a leaf and contains a 
genetic (SNP) variable. 
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Models 3 and 4 have additional restrictions that polynomials Ty{v = 1, . . . ,s) in 
( 129|) depend only on external factors and only on SNPs respectively. These models are 
considered to compare their results with ones obtained with all information taken into 
account, in order to demonstrate the importance of genetic (resp. external) data for risk 
analysis. 

Ischemic heart disease 

We have the following results. 



Model 


1 


2 


3 


4 


Prediction error 


0.19040 


0.20364 


0.22812 


0.33990 



Table 6: Results of LR for IHD dataset. 

Note that prediction error in Model 1 is only about 0.19. For the same model we 
performed also fast simulated annealing search of the optimal forest which is much more 
time-efficient, and a reasonable error of 0.23 was obtained. Model 3 application shows 
that external factors play an important role in IHD genesis, as classification based on 
external factors only gives the error less than 0.23, while usage of SNPs only (Model 4) 
lets the error grow to 0.34. 

Model 1 gave the minimal prediction error. For the optimal forest (Ti, . . . , -R4) the 
function h{Z, R) given before formula (126|) with S = {1, . . . , N} is provided by the ex- 
pression 

- O.597T1 - O.354T2 + O.52IT3 - 0.444i?i + 1.311i?2 - 0.146i?3 + 2.331i?4 - 0.226 (30) 
wher^ 

Ti = {Z^Z^ + ZqZj + Z2Z2 + Z3Zy)(ZiYZ3Zt, 
T2 = Zi{Z^Y{ZqZj + Z'j{Z4)'^Z2), T^ = Z2 + 2Z2{ZqYZt. 



The external factors 2 and 4 (i.e. AH and HC) are the most influential since the 
coefficients at them are the greatest ones (1.311 and 2.331). As was shown above, MDR 
yields the same conclusion. If the gene-environment interactions are allowed (Model 2), 
no considerable increase in predictive force has been detected. However we list the pairs 
of SNPs and external factors present in the best forest: Zj and i?2, Z^ and Z7 and 
i?4, Z5 and Ri. It is seen that Cx37 SNP is of substantial importance as it appears in 
combination with all risk factors except for smoking. 

As formula fl5U]) is hard to interpret, we select the most significant SNPs via a variant 
of permutation test. Consider a random rearrangement of the column with first SNP in 
IHD dataset. Calculate the prediction error using these new simulated data and the same 
function h as before. The analogous procedure is done for other columns (containing the 

^The sums and products are modulo 3. 
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values of other SNPs) and the errors found are given in Table [3 It is seen that the error 
increases considerably when the values of Gpla and Cx37 are permutated. The statement 
that they are the main sources of risk agrees with what was obtained above by MDR 
method. 



Prediciton error 
for Model 1 


Gpla 


Cx37 


IL-6 


PAI-1 


GpIIIa 


FXIII 


FVII 


0.19040 


0.26283 


0.25987 


0.22590 


0.21212 


0.20798 


0.20173 


0.19040 



Table 7: The SNP significance test for IHD in Model 1. 
Myocardial infarction 

For the MI dataset, under the same notations that above, the following results for our 
four models were obtained. 



Model 


1 


2 


3 


4 


Prediction error 


0.30526 


0.33058 


0.39057 


0.36455 



Table 8: Results of LR for IHD dataset. 

To comment the Table [H] we should first underline that external risk factors play less 
important role compared with IHD risk: if they are used without genetic information, the 
error increases by 0.09, see Models 1 and 3 (while the same increase for IHD was 0.03). 
The function h{Z, R) defined before formula fl2Bl) with 5* = {1, . . . , N} is is equal to 

-1.144Ti + 0.914r2 - 0.45r3 - 0.285i?i - 0.675i?2 + 0.828i?3 - 0.350i?4 - 0.055 

where 

Ti = ZiZ^i^Z^) , T2 = Zj, T3 = Z4 + Z3 + Zj + Zq. 

Thus the first tree has the greatest weight (coefficient equals -1.144), the second tree (i.e. 
Cx37 SNP) is on the second place, and external factors are less important. 

As for IHD we performed a permutation test to compare the significance of different 
SNPs. Its results are presented in Table M 



Prediciton error 
for Model 1 


Cx37 


GpIIIa 


IL-6 


FXIII 


FVII 


PAI-1 


Gpla 


0.30526 


0.44420 


0.35345 


0.33998 


0.32761 


0.32427 


0.31918 


0.30526 



Table 9: The SNP significance test for MI in Model 1. 

As seen from this table, the elimination of Cx37 SNP leads to a noticeable increase in 
the prediction error. This fact agrees with results obtained by MDR analysis of the same 
dataset. 
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3.3 Results obtained by RF and SGB methods 



The given datasets were unbalanced w.r.t. response variable and we first applied the re- 
sampling technique to them. That is, enlargement of the smaller of two groups case-control 
in the sample by additional bootstrap observations till the final proportion case: control 
would be 1:1. We also employed modifications of RF by [8] and SGB by [29j for unbal- 
anced samples, but those worked poorly for permutation tests and we do not give their 
results here. Note that due to the resampling techniques the following effect arise. Some 
observations in small groups (case or control) appear in the new sample more frequently 
than other ones. Therefore, we took the average over 1000 iterations. 

Ischemic heart disease 



Data 


RF 


SGB 


with SNP 


0.20/0.454 


0.134/0.473 


without SNP 


0.23/0.51 


0.261/0.503 



Table 10: Prediction error /prediction error in permutation test calculated via cross-validation 
for IHD dataset with employment of RF and SGB methods. 

Results of RF and SGB methods are given in Table [TOl It shows that RF and SGB 
methods give statistically reliable results (prediction error in the permutation test is close 
to 50%). Moreover, additional SNP information improves predicting ability on 11% and 
13% (SGB). It seems that SGB method is better fitted to IHD data than RF. 

Computing CVIM for each Xj, we constructed Zi as follows. We included in Zi all 
predictors X^, j ^ i, for which x^-criteria rejected hypothesis of independence between 
Xj and Xj at 5% significance level. Since the genetic information has second order effect 
on prediction of Y comparing to the risk factors, we ran the program 1000 times and then 
took the average CVIM to get a reliable estimate. An error over different runs of the 
program was around 0.01. The results are given in Table 11. 



AH 


HC 


Cx37 


Ob 


FXIII 


Sm 


Gpla 


FVII 


PAI-1 


GpIIIa 


IL-6 


8.9 


5.3 


5.1 


0.56 


0.53 


0.11 


0.1 


0.07 


0.03 


0.02 


0.01 



Table 11: Predictors are ranged in terms of their CVIM for IHD dataset. 
Thus, the most relevant predictors for IHD are AH, HC and Cx37. 
Myocardial infarction 

Results of RF and SGB methods are given in the following table. 

Table fT2] shows that RF and SGB methods give statistically reliable estimates (predic- 
tion error in the permutation test is close to 50%). Moreover, additional SNP information 
improves predicting ability on 10%. 
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Data 


RF 


SGB 


with SNP data 


0.36/0.497 


0.399/0.53 


without SNP data 


0.473/0.527 


0.482/0.562 



Table 12: Prediction error /prediction error in permutation test calculated via cross-validation 
for MI dataset with employment of RF and SGB methods. 



Cx37 


Sm 


AH 


GpIIIa 


FVII 


FXIII 


HC 


Gpla 


Ob 


IL-6 


PAI-1 


7.5 


2 


1.86 


0.03 


0.02 


^0 


^0 


^0 


^0 


^0 


^0 



Table 13: Predictors are ranged in terms of their CVIM for MI dataset. 

CVIM was calculated according to (p8|) and is given below. 
Thus, the most relevant predictors for MI are Cx37, Sm and AH. 



4 Conclusions and final remarks 

Let us briefly summarize the main results obtained. The analysis of IHD dataset showed 
that two external risk factors out of four considered (AH and HC) have a strong connection 
with the disease risk (the error of classiflcation based on external factors only is 0.25-0.26 
with p- value less than 0.01). Also, the classiflcation based on SNPs only gives a relatively 
low error of 0.28. Moreover, the most influential SNPs are Cx37 and Gpla (FXIII also 
enters the analysis only when AH and HC are present). Prediction error decreases to 
0.13 if both SNP information and external risk factors are taken into account. Note that 
excluding any of the 5 remaining SNPs (all except for two most influential) from data 
increases the error by 0.01-0.02 approximately. So, while the most influential data are 
responsible for the situation within a large part of population, there are smaller parts 
where other SNPs come to effect and provide a more efficient prognosis ( "small subgroups 
effect"). 

The MI dataset gave the following results. The most signiflcant factors of MI risk are 
the Cx37 SNP (more precisely, homozygous mutation) and smoking with a considerable 
gene-environment interaction present. The smallest prediction error of methods applied 
was 0.33-0.35 (with p- value less than 0.01). The classiflcation based on external factors 
only yields a much greater error of 0.42. Thus genetic data improves the prognosis quality 
essentially. While two factors are of great importance, other SNPs considered actually do 
not improve the prognosis essentially, i.e. no small groups effect is observed. 

The conclusions given above are based on several complementary methods of modern 
statistical analysis. These new data mining methods allow to analyze other datasets as 
weh. The study can be continued with larger datasets, in particular, involving new SNP 
data. 
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